Final Project - ERASMUS+ Mobility Program

Abstract

Erasmus is program by European union that allows students to study on foreign university through an exchange. There are a lot of parameters every student consideres when choosing a destination. In this final project we used dataset provided by European union in order to explore if there are any relationship between students characteristics like age, gender nationality etc. and their destination choice for Erasmus. We have also provided different plots to understand dataset better. Finally we have created a machine learning model that predicts destination for student.

Exploratory Data Analysis

The dataset that we used is one from 2012-2013 academic year that can be found here. The dataset is published directly by European Union. It was created from the statistical reports of the national agencies of the 33 countries participating in the Erasmus+ program (Erasmus decentralised actions) and data provided by Education Audiovisual and Culture Executive Agency (Erasmus centralised actions). The data is generated during the application process of the student and then collected by the respective universities. It contains 267547 observations and has 34 different variables.
Host institution country is one of the most interesting variables to us and we can see that it has a lot of undefined values, around 55 thousand, so we need to filter those out. For both host and home country, values are coded as country codes. However Belgium is coded as three diferent values: “BEDE”, “BEFR” and “BENL” depending on the language area (Dutch, France or German). We are going to merge all of this values to a single one for whole Belgium.
There are 34 different vairables and we are not going to use all of them, so we list ones that are most relevant for our research:

First thing we wanted to explore is to see if there is a difference between number of male and females enrolled in Erasmus. We were expecting to see significant difference as one of the cited papers suggest that there is gender gap. Pie chart we presented here to confirm this assumption.

Next we wanted to see what are the countries with most students goint to Erasmus. In order to not just list them, we decided to present this metric in a Europe map, coloring each country regarding the number of students with home university in that country. We can see that Spain, France and Germany are leading in students enrolled in Erasmus. Surprising thing is to see that Turkey lists very high.

##  [1] ES   DE   FR   IT   PL   TR   UK   NL   CZ   PT   FI   RO   AT   HU  
## [15] BENL GR   LT   SE   DK   SK   BEFR LV   CH   BG   EE   NO   HR   LU  
## [29] CY   LI   MT   BEDE IE   SI   IS  
## 35 Levels: AT BEDE BEFR BENL BG CH CY CZ DE DK EE ES FI FR GR HR HU ... UK

Other thing that was in our interest is the areas in which Erasmus is most popular. The dataset contains codes of each area adn we have used The International Standard Classification of Education to map those codes to names of areas. We have also merged areas that start with same two numbers since those are related and finally displayed statistic in form of bar plot.

To explore data further, we wanted to see age distibution. At that point we noticed that there was a student that attended Erasmus at the age of 93. There were some other unordinary records as 73 and 69 years old students. Despite that we present student distribution by age of 30 where most of the students are. 22 year old students were most frequent among males, and 21 year old students among females. On this plot we can also see that there are more female students in pretty much every category.

Last thing we wanted to explore is what are the 10 most popular universities in Europe among students. This is a simple bar plot that shows universities and number of ERASMUS students enrolled in those universities. Sweden is leading with universities in Stockholm and Linköping, while third place belongs to university in Valencia.

Methods

Strength of relationships

We want to have host country as our outcome variable and see how other variables related to it. There are 34 different variables but not all of them make sanse to include in model. After exploring dataset we decided that we need just a couple of them. Here is the formula of our model:

HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE

We also provide explanation of why we included every variable:

  • STUDENT_NATIONALITY_CDE - students from same country probably tend to have similar destination for their Erasmus, considering distance, cost of life etc.
  • STUDENT_AGE_VALUE - we think that maybe older students choose better universities and younger are more interested in different cultures and lifestyle
  • STUDENT_SUBJECT_AREA_VALUE - some countires have universities that are popular among different study areas, for example Scandinavian countires have great universities for computing
  • STUDENT_GENDER_CDE - we assumed that there might be diferences in lifestyle between males and females and what they want so we included this variable

First thing that comes to our mind when talking about strenght of relationships is linear model aclled by function lm(). However we are not having linear problem and therefore we cannot use this function. So our next option is logistic regression which has categorical variables for its outcome. Only problem here is that we don’t have binary outcome which is usually the case with logistic regression, but multiple classes. Precisely, since host country is our dependent variable we have as many categories as there are countries in that column. So for dataset 2012-2013 there are 33 countries and that is how many classes we have. There is where multinominal model with as many classes as we want comes handy. We use multinom() function from package nnet and have specified data, formula, maximum number of weights and number of iterations. Finally we created a model in R with following command:

model <- multinom(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, MaxNWts=3000, maxit = 20)

We adjusted the model so that is has maximum 3000 weights and 20 iterations.
Even thought we managed to create this model, calculating its summary just didn’t end in reasonable time so we had to take another approach. Only because of this we reduced our dataset so that we have only two outcome categories UK and ES. So we are creating model with only those two classes. Now we can apply logistic regression model since the outcome is binary. Model is created by following command:

model <- glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, family = binomial())

This calculation is done much faster so we can explore strenght of realtionships properly.

Prediction

We decided to use model trees, because using the linear regression didn’t return a pleasant results (R squared was equal to 0.02)

Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node.

Exploring and preparing the data

Two of the attributes used are categorical data : STUDENT_NATIONALITY_CDE and STUDENT_SUBJECT_AREA_VALUE. And the machine learning algorithm that we are going to use requires the attributes to be nominale, some machine learning algorithms implemented in R studio do the dummification themselves. But not in our case, therefore we needed to one hot code this two attributes

Model trees

Model tree improves on regression trees by replacing the leaf nodes with regression models. This often results in more accurate results than regression trees, which use only a single value for prediction at the leaf nodes. the regression tree algorithm in Rstudio is r part

We used M5’ algorithm (M5-prime) by Wang and Witten, which is an enhancement of the original M5 model tree algorithm proposed by Quinlan in 1992.

Sampling and evaluation

We decided to train the model with 80% of the dataset and use the remaining 20% as test data. The model uses the test data for trying to predict the host institution. As measure for performance of our result we use the mean absoulte error, which is the average error between our prediction and the actual values. In addition we use the accuracy we get from our prediction.

dt = sort(sample(nrow(df_reg), nrow(df_reg)*.8))
train<-df_reg[dt,]
test<-df_reg[-dt,]

# Train
m.m5p <- M5P(HOST_INSTITUTION_COUNTRY_CDE ~ ., data = train)

# Predict
p.m5p <- predict(m.m5p, test)

Results

Strenght of relationships

Summary of our logistic regression model is presented below:

Call:
glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + 
    STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, 
    family = binomial(), data = filtered)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8644  -0.8695  -0.5691   1.0946   3.1778  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -0.065424   0.559370  -0.117 0.906892    
STUDENT_NATIONALITY_CDEBE     -0.510193   0.089534  -5.698 1.21e-08 ***
STUDENT_NATIONALITY_CDEBG     -0.201445   0.149165  -1.350 0.176862    
STUDENT_NATIONALITY_CDECH      0.195356   0.115699   1.688 0.091320 .  
STUDENT_NATIONALITY_CDECY     -0.082228   0.242261  -0.339 0.734295    
STUDENT_NATIONALITY_CDECZ      0.298433   0.097206   3.070 0.002140 ** 
STUDENT_NATIONALITY_CDEDE     -0.015585   0.073135  -0.213 0.831253    
STUDENT_NATIONALITY_CDEDK      1.006396   0.107310   9.378  < 2e-16 ***
STUDENT_NATIONALITY_CDEEE      0.249332   0.197660   1.261 0.207159    
STUDENT_NATIONALITY_CDEES      4.032322   0.126069  31.985  < 2e-16 ***
STUDENT_NATIONALITY_CDEFI      0.720312   0.096938   7.431 1.08e-13 ***
STUDENT_NATIONALITY_CDEFR      0.380214   0.073581   5.167 2.38e-07 ***
STUDENT_NATIONALITY_CDEGR     -0.546220   0.120104  -4.548 5.42e-06 ***
STUDENT_NATIONALITY_CDEHR     -0.792309   0.251065  -3.156 0.001601 ** 
STUDENT_NATIONALITY_CDEHU     -0.139422   0.127126  -1.097 0.272765    
STUDENT_NATIONALITY_CDEIE     -0.780883   0.133951  -5.830 5.55e-09 ***
STUDENT_NATIONALITY_CDEIS      0.167390   0.253194   0.661 0.508539    
STUDENT_NATIONALITY_CDEIT     -0.972142   0.075588 -12.861  < 2e-16 ***
STUDENT_NATIONALITY_CDELI    -10.305356  84.438362  -0.122 0.902863    
STUDENT_NATIONALITY_CDELT     -0.453280   0.153101  -2.961 0.003070 ** 
STUDENT_NATIONALITY_CDELU     -0.280322   0.391914  -0.715 0.474446    
STUDENT_NATIONALITY_CDELV     -1.072289   0.245016  -4.376 1.21e-05 ***
STUDENT_NATIONALITY_CDEMT      2.796590   0.416106   6.721 1.81e-11 ***
STUDENT_NATIONALITY_CDENL      0.574346   0.086693   6.625 3.47e-11 ***
STUDENT_NATIONALITY_CDENO      1.036971   0.125072   8.291  < 2e-16 ***
STUDENT_NATIONALITY_CDEPL     -0.978837   0.087812 -11.147  < 2e-16 ***
STUDENT_NATIONALITY_CDEPT     -1.108624   0.106818 -10.379  < 2e-16 ***
STUDENT_NATIONALITY_CDERO     -0.956060   0.133953  -7.137 9.52e-13 ***
STUDENT_NATIONALITY_CDESE      1.009450   0.097950  10.306  < 2e-16 ***
STUDENT_NATIONALITY_CDESI     -0.754328   0.171553  -4.397 1.10e-05 ***
STUDENT_NATIONALITY_CDESK     -0.491452   0.135884  -3.617 0.000298 ***
STUDENT_NATIONALITY_CDETR     -0.847052   0.104855  -8.078 6.57e-16 ***
STUDENT_NATIONALITY_CDEUK     -4.049174   0.226146 -17.905  < 2e-16 ***
STUDENT_AGE_VALUE             -0.004137   0.005066  -0.817 0.414192    
STUDENT_SUBJECT_AREA_VALUE1   -0.915442   0.611400  -1.497 0.134318    
STUDENT_SUBJECT_AREA_VALUE10  -0.049669   0.705469  -0.070 0.943871    
STUDENT_SUBJECT_AREA_VALUE14  -0.312324   0.546546  -0.571 0.567695    
STUDENT_SUBJECT_AREA_VALUE2    0.337524   0.714199   0.473 0.636505    
STUDENT_SUBJECT_AREA_VALUE21   0.076313   0.545027   0.140 0.888647    
STUDENT_SUBJECT_AREA_VALUE22  -0.175862   0.543294  -0.324 0.746168    
STUDENT_SUBJECT_AREA_VALUE3    0.169626   0.553877   0.306 0.759412    
STUDENT_SUBJECT_AREA_VALUE31  -0.609875   0.543833  -1.121 0.262101    
STUDENT_SUBJECT_AREA_VALUE32  -0.872924   0.547677  -1.594 0.110966    
STUDENT_SUBJECT_AREA_VALUE34  -0.735747   0.543484  -1.354 0.175813    
STUDENT_SUBJECT_AREA_VALUE38  -0.155832   0.544337  -0.286 0.774665    
STUDENT_SUBJECT_AREA_VALUE4    0.179354   0.756269   0.237 0.812535    
STUDENT_SUBJECT_AREA_VALUE42  -0.036296   0.548599  -0.066 0.947250    
STUDENT_SUBJECT_AREA_VALUE44   0.010724   0.546164   0.020 0.984334    
STUDENT_SUBJECT_AREA_VALUE46   0.179673   0.551668   0.326 0.744658    
STUDENT_SUBJECT_AREA_VALUE48  -0.159184   0.549616  -0.290 0.772102    
STUDENT_SUBJECT_AREA_VALUE5   -1.524772   0.651268  -2.341 0.019220 *  
STUDENT_SUBJECT_AREA_VALUE52  -0.423912   0.544501  -0.779 0.436254    
STUDENT_SUBJECT_AREA_VALUE54  -0.465939   0.559786  -0.832 0.405210    
STUDENT_SUBJECT_AREA_VALUE58  -0.970780   0.545905  -1.778 0.075356 .  
STUDENT_SUBJECT_AREA_VALUE6   -1.042265   0.592845  -1.758 0.078735 .  
STUDENT_SUBJECT_AREA_VALUE62  -0.988229   0.557883  -1.771 0.076496 .  
STUDENT_SUBJECT_AREA_VALUE64  -2.723809   0.685860  -3.971 7.15e-05 ***
STUDENT_SUBJECT_AREA_VALUE72  -1.195239   0.546208  -2.188 0.028651 *  
STUDENT_SUBJECT_AREA_VALUE76  -0.523986   0.563750  -0.929 0.352648    
STUDENT_SUBJECT_AREA_VALUE8    0.368579   1.075075   0.343 0.731718    
STUDENT_SUBJECT_AREA_VALUE81  -1.162951   0.547930  -2.122 0.033801 *  
STUDENT_SUBJECT_AREA_VALUE84  -1.123646   0.681346  -1.649 0.099116 .  
STUDENT_SUBJECT_AREA_VALUE85  -1.171075   0.645957  -1.813 0.069842 .  
STUDENT_SUBJECT_AREA_VALUE86   1.129102   1.224562   0.922 0.356505    
STUDENT_SUBJECT_AREA_VALUE90 -10.391941  84.478372  -0.123 0.902097    
STUDENT_SUBJECT_AREA_VALUE99   0.229852   0.581121   0.396 0.692450    
STUDENT_GENDER_CDEM            0.150036   0.023913   6.274 3.51e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 63615  on 48489  degrees of freedom
Residual deviance: 51770  on 48423  degrees of freedom
AIC: 51904

In the rightmost column we see the p-values as well as indicator of significance of eace independent variable. We can see that age makes no impact on the output variable since its p-value is too big. Gender, however, has very small p-value therefore it is a significant predictor. When it comes to study area, it can be easily concluded that this variable does not play significant role in estimating host country. Finally interesting thing to see is that most of categories in nationality are actually significant so we can say that it is correlated with dependent variable.

R squares is usually the measurment that represents variance covered by model. Logisttic regression model uses maximum likelihood to fit the function to data, and therefore does not minimize sqaured error. For that reason R sqaured is not outputed in summary. However we can use following formula to get sense of covered variance:

1-(model1$deviance/model1$null.deviance)

By deviding residual deviance and null deviance we are basically getting R squared and in our case it is around 18%. We can concluded that variance is poorly covered by this model.

Prediction

Reviewing the results of our prediction approach we have to state that the M5P algorithm is not suited for this classification problem. With 27,04% accuracy less than a thrid of all instances were classified correctly. The mean absoulte error hints in a similar direction.

=== Summary ===

Correlation coefficient                  0.1647
Mean absolute error                      8.2327
Root mean squared error                  9.5352
Relative absolute error                 97.0758 %
Root relative squared error             98.6351 %
Total Number of Instances                76756     

This implies that, on average, the difference between our model’s predictions and the true HOST_INSTITUTION_COUNTRY_CDE score was about 8.23. With this result the given approach seems to be not appropriate. On an other note it can be that trying to the destition of a student with the given dataset is not a reasonable task.

Conculsion

This project gave us insight into the composition of the ERASMUS+ process. While working with the data we came to know how this exchange program is seen from an administrative point of view. At first we had a look at the different variables and encouterd information about founds, personal information and locations. We see that selecting ERASMUS destination is complex task. Despite that we realised that students from certain countries tend to go to specific destination probably because of the financial reasons. By knowing students nationality we can assume what their destination might be. We also realized that ERASMUS is popular among any age students, even among those that are 93 years old.

For future research we would suggest combining datasets from multiple years to get to know affections change and how does whole program evolve. Also we would propose applying different machine learning algorithms in order to obtain better results in predicting dependent variable. Finally merging multiple datasets would allow scientist to know which universities are doing good job by increasing their popularity over years, so others can learn from their progress.